Comparing commercial and open-source TTS models across quality, latency, and cost
Generated: 2026-01-28 16:49
Azure, ElevenLabs, and MiniMax - paid API services with enterprise support
| Provider | Avg Latency | Realtime Factor | Languages |
|---|---|---|---|
| Azure TTS | 262ms | 25.2x | 140 |
| Azure Streaming | 211ms | 32.7x | 140 |
| ElevenLabs Standard | 1124ms | 4.7x | 29 |
| ElevenLabs Turbo | 239ms | 18.4x | 29 |
| MiniMax | 4550ms | 1.4x | 11 |
| MiniMax Streaming | 1601ms | 4.5x | 11 |
| MiniMax PCM | 1813ms | 4.0x | 11 |
Listen and compare voice quality across commercial providers:
"Hello, how can I help you today?"
Latency: 250ms
Latency: 207ms
Latency: 833ms
Latency: 161ms
Latency: 3647ms
Latency: 1415ms
Latency: 1901ms
"Dr. Smith's API at api.example.com returns JSON for the Q4 NYSE report."
Latency: 289ms
Latency: 240ms
Latency: 1216ms
Latency: 267ms
Latency: 5202ms
Latency: 1608ms
Latency: 1792ms
"Your balance is $12,847.53, payment of $299.99 due January 15th, 2026."
Latency: 281ms
Latency: 226ms
Latency: 1318ms
Latency: 304ms
Latency: 5495ms
Latency: 1582ms
Latency: 1804ms
"欢迎使用Kira智能助手,我是您的AI客服,请问有什么可以帮您?"
Latency: 243ms
Latency: 187ms
Latency: 1191ms
Latency: 240ms
Latency: 4279ms
Latency: 1760ms
Latency: 1765ms
"Welcome to Kira智能助手, your AI客服 for 24/7 support."
Latency: 245ms
Latency: 197ms
Latency: 1062ms
Latency: 225ms
Latency: 4128ms
Latency: 1640ms
Latency: 1801ms
| Provider | 100K chars/mo | 500K chars/mo | 1M chars/mo | Model |
|---|---|---|---|---|
| Azure TTS | $1.60 | $8.00 | $16.00 | en-US-AvaMultilingualNeural, non-streaming |
| Azure Streaming | $1.60 | $8.00 | $16.00 | en-US-AvaMultilingualNeural, streaming |
| ElevenLabs Standard | $16.50 | $82.50 | $165.00 | eleven_multilingual_v2, high quality |
| ElevenLabs Turbo | $16.50 | $82.50 | $165.00 | eleven_turbo_v2_5, low latency |
| MiniMax | $6.00 | $30.00 | $60.00 | speech-2.6-turbo, non-streaming |
| MiniMax Streaming | $6.00 | $30.00 | $60.00 | speech-2.6-turbo, streaming MP3 |
| MiniMax PCM | $6.00 | $30.00 | $60.00 | speech-2.6-turbo, streaming PCM |
Output format comparison for WebSocket/WebRTC integration
| Provider | Mode | Format | Sample Rate | Bit Depth |
|---|---|---|---|---|
| Azure TTS | Non-Streaming | RIFF/WAV PCM | 24 kHz | 16-bit Mono |
| Azure Streaming | Streaming | RIFF/WAV PCM | 24 kHz | 16-bit Mono |
| ElevenLabs Standard | Non-Streaming | Raw PCM | 24 kHz | 16-bit Mono |
| ElevenLabs Turbo | Streaming | Raw PCM | 24 kHz | 16-bit Mono |
| MiniMax | Non-Streaming | WAV | 32 kHz | 16-bit Mono |
| MiniMax Streaming | Streaming (MP3) | MP3 (128kbps) | 32 kHz | Compressed |
| MiniMax PCM | Streaming (PCM) | Raw PCM | 24 kHz | 16-bit Mono |
| Qwen3-TTS | Both | Raw PCM | 24 kHz | 16-bit Mono |
MiniMax Streaming (MP3): Uses lossy MP3 compression. Requires MP3 decoding before WebRTC transmission, adding latency and complexity.
MiniMax PCM: Raw PCM output at 24kHz - directly compatible with WebRTC/WebSocket. No transcoding needed.
Status: Testing requirement
Status: Production use
Time to generate audio from text. Lower is better. Under 500ms feels instant, 500-1500ms is acceptable, over 1500ms feels slow.
How much faster than playback speed. Example: 15x means a 3-second clip generates in 0.2 seconds. Need at least 1x for real-time apps.
Price per 1 million characters. 1M chars ≈ 250 pages of text or ~170 minutes of speech.
Total number of languages the provider supports. More languages means broader global coverage.
Key Insight: Streaming is a delivery optimization, not a quality setting. The audio quality is identical - only the timing differs.
Note: Audio durations may vary slightly between runs because TTS synthesis is non-deterministic and each API call generates audio independently.
Both modes use the exact same voice synthesis model. The audio generation algorithm is identical.
Non-streaming: Wait for complete audio, return all at once.
Streaming: Return audio in chunks as it's generated.
Streaming lets audio playback start sooner (e.g., 180ms vs 260ms), reducing perceived wait time.
Like streaming vs downloading a video - the quality is identical, you just start watching sooner with streaming.
TTS models don't produce identical output every time. Each API call generates audio independently with slight variations in pacing.
ElevenLabs Standard vs Turbo use completely different models with different speaking rates, causing noticeable duration differences.
Small duration differences (±10%) between runs are normal. The content and quality remain consistent.
Why only ~20% difference? Azure is already highly optimized with global infrastructure.
140+ regions worldwide. Requests are served from the nearest data center, minimizing network latency.
Base latency is already low (~260ms). Streaming improves by ~50ms - noticeable but not dramatic.
For short utterances, total generation time is minimal. Streaming benefit increases with longer text.
Compare: MiniMax streaming saves 65%, ElevenLabs Turbo saves 79%. Azure saves 20% because it's already optimized.
Two Models Available:
eleven_multilingual_v2): ~1100ms - Best quality, higher latencyeleven_turbo_v2_5): ~250ms - Near-Azure speed, slightly lower qualityPremium quality with advanced neural models. Best for pre-generated content where latency isn't critical.
Optimized for real-time streaming. Quality is still excellent - most users won't notice the difference.
api.elevenlabs.io → US onlyapi-global-preview.elevenlabs.io → Auto-routes to closest (US/EU/Singapore)
Use Turbo for real-time applications. Use Standard only when quality is paramount and latency is acceptable.
Key Factor: MiniMax is a Chinese company. We tested using api.minimax.io (international endpoint), which still routes to China servers.
Note: MiniMax also offers api.minimax.chat (China domestic endpoint) which may have lower latency for users in China.
Both endpoints (api.minimax.io and api.minimax.chat) route to servers in mainland China. MiniMax has no data centers outside China.
The .io domain is just a gateway for overseas access. Requests still travel to China and back, adding 200-400ms+ network latency.
Unlike Azure (140+ global regions) or ElevenLabs (US/EU servers), MiniMax lacks distributed infrastructure.
~4550ms → ~1600ms TTFB (65% faster). Always use streaming mode for MiniMax to reduce perceived wait time.
Best overall: lowest latency (~210ms), 140 languages, $16/1M. Ideal for real-time global applications.
Best quality: use Turbo (~240ms) for real-time, Standard (~1100ms) for premium quality. $165/1M.
Best for Chinese: native Mandarin support. High latency internationally (~1600ms), recommended for China users.
Qwen3-TTS and LuxTTS - free/low-cost alternatives with self-hosting options
| Provider | Avg Latency | Realtime Factor | Languages |
|---|---|---|---|
| LuxTTS | 2163ms | 2.6x | 1 |
Listen and compare voice quality across open-source providers:
"Hello, how can I help you today?"
Latency: 917ms
"Dr. Smith's API at api.example.com returns JSON for the Q4 NYSE report."
Latency: 2164ms
"Your balance is $12,847.53, payment of $299.99 due January 15th, 2026."
Latency: 3407ms
| Provider | 100K chars/mo | 500K chars/mo | 1M chars/mo | Model |
|---|---|---|---|---|
| Qwen3-TTS | $1.00 | $5.00 | $10.00 | qwen3-tts-flash, non-streaming |
| Qwen3-TTS Streaming | $1.00 | $5.00 | $10.00 | qwen3-tts-flash-realtime, streaming |
| LuxTTS | $0.00 | $0.00 | $0.00 | Local CPU - no API cost |